Tutoial - Part 2 TutoRial - Part 2

Marine Ecosystem Dynamics - 2025

Author

Kinlan M.G. Jan

Pipes

Pipes, expressed as %>% or |>, are very useful and make our code clearer. Using pipes, our data flow from one function to another.

Exercises

  • Rewrite these chunks of code using the pipes
sum(c(2.2,4.1,2,pi))
c(2.2,4.1,2,pi) |> sum()
# OR
c(2.2,4.1,2,pi) %>% sum()
round(sum(c(2.2,4.1,2,pi)))
c(2.2,4.1,2,pi) |> sum() |> round()
# OR
c(2.2,4.1,2,pi) %>% sum() %>% round()
round(sum(c(2.2,4.1,2,pi)), digits = 3)
c(2.2,4.1,2,pi) |> sum() |> round(digits = 3)
# OR
c(2.2,4.1,2,pi) %>% sum() %>% round(digits = 3)

Tidy the data with tidyr

As seen in the slides, a tidy table has:

  1. Each variable in its own column
  2. Each observation in its own row

To reach this, tidyr has 4 key functions:

  1. pivot_longer
  2. pivot_wider
  3. unite
  4. separate

Exercises

  • If this is not done yet, download the dataset zooplankton_seasonality.csv
  • Import the dataset in your environment

  • Is this dataset a tidy dataset?

First 6 rows of the dataset zooplankton_seasonality
Month_abb Year Station Coordinates Group Taxa Biomass
Jan 2009 BY15 20.05000/57.33333 Copepoda Acartia 6.650319
Jan 2009 BY31 18.23333/58.58812 Copepoda Acartia 1.816994
Jan 2009 BY5 15.98333/55.25000 Copepoda Acartia 5.562097
Jan 2009 BY15 20.05000/57.33333 Copepoda Centropages 5.738562
Jan 2009 BY31 18.23333/58.58812 Copepoda Centropages 1.228759
Jan 2009 BY5 15.98333/55.25000 Copepoda Centropages 14.405224

Each variable has its own column
Each variable has its own row
Coordinates has 2 values

  • Separate the column Coordinates in 2 news columns: Longitude and Latitude
library(tidyr)
zooplankton |>
  separate(Coordinates, into = c("Longitude", "Latitude"), sep = "/")
First 6 rows of the transformed dataset zooplankton_seasonality
Month_abb Year Station Longitude Latitude Group Taxa Biomass
Jan 2009 BY15 20.05000 57.33333 Copepoda Acartia 6.650319
Jan 2009 BY31 18.23333 58.58812 Copepoda Acartia 1.816994
Jan 2009 BY5 15.98333 55.25000 Copepoda Acartia 5.562097
Jan 2009 BY15 20.05000 57.33333 Copepoda Centropages 5.738562
Jan 2009 BY31 18.23333 58.58812 Copepoda Centropages 1.228759
Jan 2009 BY5 15.98333 55.25000 Copepoda Centropages 14.405224
  • Combine the column Groupand Taxa into a new column Group_Taxa and save the dataframe as tidy_df
library(tidyr)
tidy_df <-
  zooplankton |>
  separate(Coordinates, into = c("Longitude", "Latitude"), sep = "/") |> 
  unite("Group_Taxa", c(Group, Taxa))
First 6 rows of the transformed dataset zooplankton_seasonality
Month_abb Year Station Longitude Latitude Group_Taxa Biomass
Jan 2009 BY15 20.05000 57.33333 Copepoda_Acartia 6.650319
Jan 2009 BY31 18.23333 58.58812 Copepoda_Acartia 1.816994
Jan 2009 BY5 15.98333 55.25000 Copepoda_Acartia 5.562097
Jan 2009 BY15 20.05000 57.33333 Copepoda_Centropages 5.738562
Jan 2009 BY31 18.23333 58.58812 Copepoda_Centropages 1.228759
Jan 2009 BY5 15.98333 55.25000 Copepoda_Centropages 14.405224
  • Create a wide table with columns having the Biomass values for each Group_Taxa and save the dataframe as wide_df
library(tidyr)
wide_df <-
  tidy_df |> 
  pivot_wider(names_from = Group_Taxa, values_from = Biomass) 
First 6 rows of the transformed dataset zooplankton_seasonality
Month_abb Year Station Longitude Latitude Copepoda_Acartia Copepoda_Centropages Copepoda_Pseudocalanus Copepoda_Temora Rotatoria_Synchaeta Copepoda_Eurytemora Rotatoria_Keratella Cladocera_Bosmina Cladocera_Evadne Cladocera_Podon
Jan 2009 BY15 20.05000 57.33333 6.650319 5.7385615 10.522882 9.725488 0.3921570 NA NA NA NA NA
Jan 2009 BY31 18.23333 58.58812 1.816994 1.2287586 5.633984 4.993465 0.4705890 NA NA NA NA NA
Jan 2009 BY5 15.98333 55.25000 5.562097 14.4052240 21.594775 45.738529 0.3921570 NA NA NA NA NA
Jan 2010 BY15 20.05000 57.33333 2.467319 0.3071893 13.601301 7.549021 0.1568628 NA NA NA NA NA
Jan 2010 BY31 18.23333 58.58812 2.248367 0.3856208 2.660128 8.418301 0.4117650 0.0849674 NA NA NA NA
Jan 2011 BY15 20.05000 57.33333 5.065367 2.9803908 49.660135 36.431384 0.5490210 NA NA NA NA NA

Data handling with dplyr

After finishing tidying the data, we often use the dplyr package to process our data.

Exercises

  • What is the class of the Year columns of the tidy_df dataframe?
    If they are not numeric, mutate them as numeric values.
str(tidy_df)
#> tibble [2,956 × 7] (S3: tbl_df/tbl/data.frame)
#>  $ Month_abb : chr [1:2956] "Jan" "Jan" "Jan" "Jan" ...
#>  $ Year      : chr [1:2956] "2009" "2009" "2009" "2009" ...
#>  $ Station   : chr [1:2956] "BY15" "BY31" "BY5" "BY15" ...
#>  $ Longitude : chr [1:2956] "20.05000" "18.23333" "15.98333" "20.05000" ...
#>  $ Latitude  : chr [1:2956] "57.33333" "58.58812" "55.25000" "57.33333" ...
#>  $ Group_Taxa: chr [1:2956] "Copepoda_Acartia" "Copepoda_Acartia" "Copepoda_Acartia" "Copepoda_Centropages" ...
#>  $ Biomass   : num [1:2956] 6.65 1.82 5.56 5.74 1.23 ...

Longitude and Latitude are characters

library(dplyr)
tidy_df |> 
  mutate(Year = as.numeric(Year))
  • Then, kepp all Year between 2012 and 2015
library(dplyr)
tidy_df |> 
  mutate(Longitude = as.numeric(Longitude),
         Latitude = as.numeric(Latitude)) |> 
  filter(Year %in% 2012:2015)
  • Then, only keep the data from the Station BY31
library(dplyr)
tidy_df |> 
  mutate(Longitude = as.numeric(Longitude),
         Latitude = as.numeric(Latitude)) |> 
  filter(Year %in% 2012:2015) |> 
  filter(Station == "BY31")

# OR

tidy_df |> 
  mutate(Longitude = as.numeric(Longitude),
         Latitude = as.numeric(Latitude)) |> 
  filter(Year %in% 2012:2015,
         Station == "BY31")
  • Then, select all columns except Longitude and Latitude
library(dplyr)
tidy_df |> 
  mutate(Longitude = as.numeric(Longitude),
         Latitude = as.numeric(Latitude)) |> 
  filter(Year %in% 2012:2015,
         Station == "BY31") |> 
  select(-Longitude,
         -Latitude)
  • Then, rename Month_abb as Month
library(dplyr)
tidy_df |> 
  mutate(Longitude = as.numeric(Longitude),
         Latitude = as.numeric(Latitude)) |> 
  filter(Year %in% 2012:2015,
         Station == "BY31") |> 
  select(-Longitude,
         -Latitude) |> 
  rename(Month = Month_abb)
  • Then, group_by: Month and Group_Taxa and take the Biomass average and standard deviation and save the dataframe as summarized_df
library(dplyr)
summarised_df <-
  tidy_df |> 
  mutate(Longitude = as.numeric(Longitude),
         Latitude = as.numeric(Latitude)) |> 
  filter(Year %in% 2012:2015,
         Station == "BY31") |> 
  select(-Longitude,
         -Latitude) |> 
  rename(Month = Month_abb) |> 
  group_by(Month, Group_Taxa) |> 
  summarise(average = mean(Biomass),
            standard_deviation = sd(Biomass))

Ploting the data with ggplot2

In this part, we will build a plot step by step using the grammar of graphic in ggplot2

  • Load the package and only keep the values for the copepod Acartia from the summarised_df dataset in a new dataset called acartia
library(ggplot2)
acartia <-
  summarised_df |> 
  filter(Group_Taxa == "Copepoda_Acartia")
  • Initiate a ggplot with the dataset acartia with the Month as the x-axis and the average biomass as the y-axis
ggplot(data = acartia,
       mapping = aes(x = Month, y = average))

  • Add a barplot geometry to the plot
ggplot(data = acartia,
       mapping = aes(x = Month, y = average)) +
  geom_bar(stat = "identity") # <- this is needed for the barplot geometry

  • Arrange the bar from the lowest to the highest values
ggplot(data = acartia,
       mapping = aes(x = reorder(Month, average), y = average)) +
  geom_bar(stat = "identity") # <- this is needed for the barplot geometry

  • Add a color filling in the bars according the Month
ggplot(data = acartia,
       mapping = aes(x = reorder(Month, average), y = average, fill = Month)) +
  geom_bar(stat = "identity") # <- this is needed for the barplot geometry

  • Change the axis as Biomass and Month, and add a title
ggplot(data = acartia,
       mapping = aes(x = reorder(Month, average), y = average, fill = Month)) +
  geom_bar(stat = "identity") + # <- this is needed for the barplot geometry
  labs(x = "Month", y = "Biomass", title = "My nice ggplot")